Introduction

According to the documentation(https://search.r-project.org/CRAN/refmans/spData/html/boston.html),this dataset contains housing data that was collected as part of the 1970 census of Boston, Massachusetts.The data frame has 506 rows and 20 columns and it contains the corrected data from the Harrison and Rubinfeld (1978) data.Each observation (row) in the dataset contains a collection of statistics corresponding to a single census ‘tract’ (a small geographic region containing multiple houses, defined specifically for a census). Some notes are that that MEDV is censored, in that median values at or over USD 50,000 are set to USD 50,000.

In this study we will consider the spatial distribution of the CMEDV variable. This variable corresponds to the median value (in USD 000s) of owner-occupied housing in each census tract. Each tract is also associated with a point location; geographic coordinates for this point (measured in decimal degrees latitude and longitude), as well as the town in which it is located (within the Greater Boston area), are provided for each observation.

We are going to derive a smaller dataframe from the above data set that contains only the variables TOWN, LON, LAT and CMEDV:

Analysis of data

First, we read the boston data file from spData package in R and select the columns of interest. From Table 1, we can see that the variable TOWN is a factor with 92 levels. We also check for the existence of missing values. As we can observe in the table below there are no missing values to report in the data.
The first rows of the dataset, boston.c
TOWN LON LAT CMEDV
Nahant -70.96 42.26 24.0
Swampscott -70.95 42.29 21.6
Swampscott -70.94 42.28 34.7
Marblehead -70.93 42.29 33.4
Marblehead -70.92 42.30 36.2
Number of missing values
x
TOWN 0
LON 0
LAT 0
CMEDV 0

Visualisation

We are going to visualize the coordinates using ggplot. We can see that we can’t derive any information from the plot below so we will need to incorporate maps into our analysis.
\label{fig:fig1}Coordinates

Coordinates

In the map below , we can already see that the points on the map representing the the latitudes and longitudes, are not matching the towns.

## Assuming "lon" and "lat" are longitude and latitude, respectively
Coordinates on map

Coordinates on map

In order to be sure, we are going to zoom in and in fact we can clearly see that some of the towns appear on the water.

## Assuming "lon" and "lat" are longitude and latitude, respectively
Coordinates on map

Coordinates on map

Finally we analyse our data further by inspecting individual towns on Google Maps and in particular Cambridge. First we search for the right coordinates on Google Maps and add them to our map. The blue dot represents the correct coordinates, while the green dot shows the coordinates in our data. We can clearly see that there is a significant difference between the two.

## Assuming "LON" and "LAT" are longitude and latitude, respectively

Zoom on map

Zoom on map

Zoom on map

Coordinates correction

In order to correct the data, we suppose that all coordinates are shifted by a certain amount. We assume that there are \(n_j\) observations in town \(j\), and for each observation \(k\) in town \(j\),we denote the longitudinal coordinate as $x_{j,k} , k = 1,, n_j $. Then we assume:

\[ x_{j,k}=TC^{(x)}_j+\Delta^{(x)}_{j,k}\] where \(TC^{(x)}_j\) is the longitudinal coordinate of the center of town j, and \(\Delta^{(x)}_{j,k}\) is the displacement of observation \(k\) in town \(j\) from the town center.We also assume that the latitudinal coordinates (which we denote \(y_{j,k}\)) satisfy a similar relationship. The suggested systematic error is therefore such that \((TC^{(x)}_j ,TC^{(y)}_j)\) has been misspecified for \(j = 1, \dots, n\) where n is the number of towns.

To find the displacement, we are going to use the correct center coordinates for each town in Boston that exist in the file BostonTownCentres.csv. First we are going to have a quick look at the data. Note: We can see that the towns in this instance are of type character.

## Rows: 92 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): town
## dbl (2): lat, lon
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Correct coordinates for each town in Boston
town lat lon
Arlington 42.41537 -71.15644
Ashland 42.26066 -71.46413
Bedford 42.49173 -71.28179
Belmont 42.39593 -71.17867
Beverly 42.55843 -70.88005

next we’re using an appropriate mutating join to combine the two data sets.We check and observe that the number of columns in boston.c doesn’t match the number of columns in the new data frame.We find that the missing data corresponds to Saugus, which is spelled as Sargus in boston.c. As a result, we correct the instances of Sargus and join the corrected data frame with BostonTownCentres.This time the column match.

## [1] FALSE
## [1] "Sargus"
## [1] TRUE

Next we’re going to visualise the correct coordinates.We can alredy observe that there are no points on water and they seem to match the towns on the map.

## Assuming "lon" and "lat" are longitude and latitude, respectively
Correct coordinates on map

Correct coordinates on map

We’re going to zoom into an area to check if everything is in order.

## Assuming "lon" and "lat" are longitude and latitude, respectively

Zoom on correct coordinates on map

Zoom on correct coordinates on map

Zoom on correct coordinates on map

###Correct coordinates In order to fix our data set, we need replace the centroid for each town (i.e. for \(j = 1,\dots,n\)) of the \(n_j\) boston.c locations with the true town center. First, we are going to find the centroid in our dataset by grouping the data by town and finding the mean longitude and latitude. Then we calculate the displacement as so: \[x_{j,k}=TC^{(x)}_j+\Delta^{(x)}_{j,k} \Rightarrow \Delta^{(x)}_{j,k}=x_{j,k}-TC^{(x)}_j\] In the equation above, \(x_{j,k}\) is known and is equal to the coordinates in boston.c and \(TC^{(x)}_j\) was calculated above as the mean lon and lat. After, we add the displacement of each town to the centroids contained in BostonTownCentres.csv and create a new dataframe containing two columns with the true coordinates for each observation. Hence we add to the above combined dataframe.

Final maps

Final maps

Final maps

Final maps

Final maps

Final maps

Visualisation

Finally, we construct a visualisation that shows the spatial distribution of the median value of owner-occupied housing in Greater Boston in 1970. In this instance, we are going to use ggmap.We observe that for some towns have only one observation so we can’t create polygons.

## Source : https://maps.googleapis.com/maps/api/staticmap?center=42.36008,-71.05888&zoom=10&size=640x640&scale=2&maptype=terrain&key=xxx-0NQyKizPR9jdAYCfTiyB5IhVfbdU2xI

Choropleth map

We are going to use the elect80 data set which contains the Presidential election results of 1980 covering 3,107 US counties using geographical coordinates. First of all, we are going to use the FIPS codes to find the exact state and county. We are going to use the county.fips function which is a database matching FIPS codes to maps package county and state names.After matching our two dataframes we drop the columns and keep the region and pc_turnout variables. To plot the outlines of a geographical region, we use ggplot2::map_data(). This will extract coordinate data from the maps library, to create a data frame containing the boundaries of one of a selection of geographical regions. Once we have the coordinates for the boundaries of our spatial regions, we can match this to the values of our spatial variable of interest using one of the ‘mutating joins’ from the dplyr library.